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Abstract 

We consider the problem of estimating the number of components and 
the relevant variables in a multivariate multinomial mixture. This kind of 
models arise in particular when dealing with multilocus genotypic data. 
A new penalized maximum likelihood criterion is proposed, and a non- 
asymptotic oracle inequality is obtained. Further, under weak assump- 
tions on the true probability underlying the observations, the selected 
model is asymptotically consistent. On a practical aspect, the shape of 
our proposed penalty function is defined up to a multiplicative param- 
eter which is calibrated thanks to the slope heuristics, in an automatic 
data-driven procedure. Using simulated data, we found that this proce- 
dure improves the performances of the selection procedure with respect 
to classical criteria such as BIC and AIC. The new criterion gives an 
answer to the question "Which criterion for which sample size?" . 

Keywords: Biostatistics; Latent class model; Multilocus genotypic data; Multi- 
variate multinomial mixture; Penalized Likelihood; Population genetics; Slope 
heuristics; Variables selection 

1 Introduction 

This article is concerned with the unsupervised classification on categorical mul- 
tivariate data. The model-based clustering, which uses finite mixture models, 
is an intuitive and rigorous framework for the unsupervised classification. How- 
ever there is no clear consensus on the way to gather individuals in general: on 
the basis of well separated clusters, or on the basis of the components of the 
mixture distribution? We refer to Baudry [2009] for a general discussion on this 
topic. Finite mixture models are specially adapted when each class is supposed 
to be characterized by a set of parameters, for instance in population genetics: 
in this case the populations that the biologists look for are characterized by their 
allelic frequencies and a genetic equilibrium; this corresponds to the notion of 
population as a reproduction unit, or a group of individuals sharing the same 
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genetic structure. Finite mixture models are also known in the literature as the 
latent class models. 

The observations are n independent realizations of a random vector, whose 
number L of coordinates (variables) may be large. The individuals of the sample 
are clustered into a certain unknown number K of populations on the basis of the 
frequencies of apparition of the possible states of each variable. It may happen 
that only a subset S of the variables are relevant for clustering purposes, and 
the others are just noise. Thus, in addition to the number K of populations 
and the frequencies of the different states, we are also interested in the subset 
S, which may have significance in the interpretation of the results. 

A number of clustering methods for categorical multivariate data have been 
proposed in recent years in the context of genomics (see [Chen et al., 2006, 
Corander et al., 2008, Pritchard et al., 2000]). But the problem of variable selec- 
tion for clustering using such data was first addressed in [Toussilc and Gassiat, 
2009] , where the question is regarded as a model selection problem in a density 
estimation framework. First the components of a finite mixture distribution are 
identified, then the individuals are clustered into these components using the 
Maximum A Posteriori (MAP) method. 

Using simulated data, that article shows that the variable selection procedure 
based on the Baycsian Information Criterion (BIC) significantly improves clus- 
tering and prediction capacities in our framework. It also gives a theoreti- 
cal consistency result: when the true density Pq underlying the observations 
belongs to one of the competing models, then there exists a smallest model 
■M(k , So) containing P ; further, the BIC type criteria select M.(k , s ) w i tn 
probability tending to one as the sample size n goes to infinity. This consis- 
tency approach requires large sample sizes which may be difficult to obtain. 
However the knowledge of the true model, aside the frequencies of the states, is 
an important information for the interpretation of the results. 

In the present paper we adopt an oracle approach. We do not aim at choosing 
the true model underlying the data, even if our procedure performs well also 
for that. The criteria are rather designed to minimize some risk function of 
the estimated density with respect to the true density. In this context simpler 
models can be preferred to Ai^o, s )i m which too many parameters can entail 
estimators which overfit the data. Actually there is no need to assume that Pq 
belongs to one of the competing models M.(k.s)- 

BIC relies on a strong asymptotic assumption, and can thus require large 
sample sizes to reach its asymptotic behavior; practically BIC is known to 
ovcrpenalize, and therefore selects too small models for small or medium values 
of n (see [Nadif and Govaert, 1998]). On the contrary Akaike's Information 
Criterion (AIC) is known to underpenalize, and selects too large models for 
large and medium values of n. We would like a criterion which gathers the 
virtues of both AIC and BIC, and performs well for different values of n. 

In this article, we propose a non asymptotic penalized criterion based on the 
metric entropy theory of Massart (in particular [Massart, 2007]). It leads to a 
non asymptotic oracle inequality, which compares the risk of the selected esti- 
mator to the risk of the estimator associated with the unknown best model (see 
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Theorem 1 below) . There exists a large literature on model selection via penal- 
ization from a non asymptotic perspective. This literature is still in development 
with the appearance of sophisticated tools of probability such as concentration 
and deviations inequalities (see [Massart, 2007] and the references therein). In 
mixture models the non asymptotic approach is very recent, the first related 
work being [Maugis and Michel, 2009] for the Gaussian mixture model. 

However, the obtained penalty function presents drawbacks: it depends on 
a multiplicative constant for which sharp upper bounds are not available, and 
it leads in practice to an ovcrpcnalization — even worse than BIC. Therefore 
our theoretical result mainly suggests the shape of the penalty function: 

pen n (to) = \D m /n, 

where D m is the dimension of model to, and A an unknown parameter de- 
pending on the sample size and the complexity of the collection of models un- 
der competition, which has to be calibrated. A calibration of A with the so- 
called slope heuristics has been proposed in [Birge and Massart, 2007] in such 
a case. We propose a modified version based on a sliding window of this cal- 
ibration method. The resulting criterion does not require an ad-hoc choice of 
the penalty parameters and adapts automatically to the data. Although the 
full theoretical validation of slope heuristics is provided only in the Gaussian 
homoscedastic and heteroscedastic regression frameworks [Arlot and Massart, 
2009, Birge and Massart, 2007], they have been implemented in several other 
frameworks (see [Lcbarbier, 2002, Maugis and Michel, 2008, Verzelen, 2009, 
Villcrs, 2007] for applications in density estimation, genomics, etc.). The sim- 
ulations performed in Subsection 4.3 illustrate that our criterion behaves well 
with respect to more classical criteria as BIC and AIC, both to estimate the 
density, even when n is relatively small, and to retrieve the true model. It can 
be seen as a representative of the family of the General Information Criteria 
(see for instance [Bai ct al., 1999] whose criterion is less intuitive but presents 
some analogy with the slope heuristics) . 

The paper is organized as follows. Section 2 is devoted to the presentation 
of the mixture models framework and to the model selection paradigm. In 
Section 3 we state and prove our main result, the oracle inequality. Section 4 is 
devoted to the practical aspect of our procedure which has been implemented 
in the stand alone software MixMoGenD (Mixture Model using Gcnotypic Data) 
(see [Toussile and Gassiat, 2009]). Results on simulated experiments are also 
presented: we compare our proposed criterion to classical BIC and AIC, in 
both points of view of the selection of the true model and of density estimation. 
Eventually, the Appendices contain several technical results used in the main 
analysis. 
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2 Model and methods 



2.1 Framework 

We suppose we deal with independent and identically distributed (iid) realiza- 
tions of a multivariate random vector X = [X )i<i<l- We consider two main 
settings: 

1. Each X 1 is a multinomial variable taking values in {1, ... , Ai}. 

2. Each X 1 consists in a (non ordered) set {X 1 ' 1 , X 1 ' 2 } of two (that may be 
equal) qualitative variables taking their values in the same set {1, . . . , A{\. 

All along this article, these two settings will be referred to as Case 1 and Case 2. 
In both cases, the numbers Ai of allowed states are supposed to be known, and 
to verify A\ > 2. 

The first case is a usual latent class model with various applications (psy- 
chomatrics, marketing, credit scoring, genomics, etc.), while the last one is 
more specific to genotypic data. In this context X = (X 1 )i<i<l represents the 
genotype of an individual at L loci of its DNA. Case 1 corresponds to haploid 
organisms, with a single representative of each chromosome; at any locus I a 
single allele X 1 is measured. Case 2 corresponds to diploid organisms, with two 
representatives of each chromosome; at any locus I, two alleles X 1 ' 1 and X 1,2 
are observed together. 

We consider a model-based clustering, which means that the sample is a 
finite mixture of an unknown number K of populations (clusters), each being 
characterized by a set of frequencies of the states. Let denote by Z the (unob- 
served) population an individual comes from. Variable Z takes its values in the 
set {1, . . . , K} of the labels of the different clusters. Its distribution is given 
by the vector ir = (irk)i<k<K , where iTk = P(Z = k). Conditionally to Z, the 
variables X 1 , . . . , X L are supposed to be independent. In Case 2, the states 
X L1 and X 1 ' 2 for the I th variable are also supposed to be independent condition- 
ally to Z. The preceding two assumptions are what biologists respectively call 
Linkage Equilibrium (LE) and Hardy- Weinberg Equilibrium (HWE). According 
to these assumptions, the probability distribution of a genotype x = ( ;c ')i<;<l 
in a population k is given in the following equations 

L 

P{x\ Z = k) =]JP(x l \Z = k) 

i=i 

Case 1: P [x \Z = k) = a klx i 

Case 2: P (x l \Z = k) = (2 - 1^,1^1,2) a k ^ x i,ia k ^ x i,2 (1) 

where ctk,l j is the probability of state j associated to variable X 1 in population 
k. The mixing proportions 7Tfc and the probabilities oik,l,j will be treated as 
parameters. 

In the context of genomics, Hardy- Weinberg and linkage equilibria are based 
on several simplifying assumptions that can seem unrealistic; however they have 
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still proven to be useful in describing many population genetic attributes and 
serve as a base model in the development of more realistic models of microevo- 
lution. Further, the choice of estimators derived from the maximum likelihood 
estimator (MLE) responds to the wish of biologists to group the sample into clus- 
ters minimizing the Hardy- Weinberg and linkage desequilibria, and this brings 
some robustness to our modeling (see [Latch ct al., 2006] and references therein). 

Going deeper, the oracle approach emphasizes that we should often prefer 
simplified and misspecified models. This introduces a modeling bias in order 
to get more robust estimators and classifiers, and at the end we get a smaller 
estimation error. This legitimizes also the following simplification. 

It may happen that the structure of interest is contained in only a subset S 
of the L available variables, the others been useless or even harmful to detect a 
reasonable clustering into statistically different populations. For the variables 
in S, the frequencies of the states in at least two populations are different: 
we will call them clustering variables. For the other variables, the states are 
supposed to be equally distributed across the clusters. This approximation is 
theoretically justified by the oracle heuristics, which is able to take advantage of 
the misspecification; the simulations performed in [Toussilc and Gassiat, 2009] 
illustrate its benefits. 

We denote by (3ij the frequency of state j associated to variable X 1 in the whole 
population: 

Pij = otijj = ■■■ = a>k,i,j ■■■ = ax,i,i for an y ' i- & anci 1 - 3 < M- 

Obviously, S = if K = 1, otherwise S belongs to V*(L), the set of all non 
empty subsets of {1,...,L}. 

Summarizing all these assumptions, we can write down the likelihood of an 
observation x = { xl ) 1<l<L : 





l£S 

where 8 = (ir, a, j3) is a multidimensional parameter, with 

a = ( a k,l,j)i<k<K\ IeS; l<j<A, 
P = (Plj)l$S; l<j<A, ■ 

For a given K and S, 9 = 6(k,S) ranges in the set 



II > 



x 



(3) 



L ;es 



its 
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where = jp = (pi, p 2 ,- • • , Pr) G [0, If : Y.]=iPj = l} is the ( r _ !)" 

dimensional simplex. 

Then we consider the collection of all parametric models 

M(if.s) = { p (K,s,e) ■ 9 G Q(k-,j?)} (4) 

with (A,S) G M := {(1,0)} U (N\{0, 1}) x V*(L). To alleviate notations, wc 
will often use the single index m G M instead of (AT, S) . 

Each model M(k,s) corresponds to a particular structure situation with K 
clusters and a subset S of clustering variables. Inferring K and S becomes a 
model selection problem in a density estimation framework. It also leads to a 
data clustering, via the estimation 9 of the parameter d^x,s) an d the prediction 
of the class z of an observation x by the MAP method: 

z = argmaxP,„ „g\ (Z = k\X = x) . 

l<k<K ( ' ' ' 

2.2 Model selection via penalization 

A common method to solve model selection problems consists in the minimiza- 
tion of a penalized maximum likelihood criterion. In each model M.(k,s)i con- 
sider the maximum likelihood estimator (MLE) P{k,s) ~ P {k s §)> wmcri mini- 
mizes the log-likelihood contrast 

n 

ln(P) = — E lnP W ( 5 ) 
i—\ 

where Xi describes the individual i in the sample. Then a data driven selected 
model M(g n £ n ^ is chosen, where , S n ) minimizes a penalized maximum 
likelihood criterion of the form 

crit(A, S) = ln{P {K ,s)) + pen„(A, S), 

where pen„ : M — > R + is the penalty function. Eventually the selected esti- 
mator is g y 

The penalty function is designed to avoid overfit problems. Classical penal- 
ties, such as the ones used in AIC and BIC criteria, are based on the dimension 
of the model. In the following, we will refer to the number of free parameters 

D {KS) =K-l + KY l (A l -l)+J2(A l - 1) (6) 

les i$s 

as the dimension of the model A4(k,S)- The penalty functions of AIC and BIC 
are respectively defined by 

P en Aic ( m ) = - - D m ; 

In & 

P en BIC ( m ) = ~Y~ ■ Dm - 
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Our work is centered on the MLE estimator P(k. s) > but this last one presents 
a drawback. For the sake of density estimation, we would like to use the 
Kullback-Lciblcr divergence KL as a risk function to measure the quality of 
an estimator. Unfortunately, when an state is not present in the sample, the 
MLE estimator assigns to it a zero probability. As a consequence, the Kullback 
risk Ep KL I Pq, P(k, s) ) is infinite. 



The Hellinger distance offers an alternative to the Kullback-Leibler diver- 
gence. Let us consider two probability distribution P and Q, admitting respec- 
tively s and t as density functions with respect to a common cr-finite measure 
/z. We call Hellinger distance between P and Q the quantity h(P, Q) defined by 



h(P,Q) 2 



d/j,(x) 



(8) 



Let (K*, S*) be a minimizer in (K, S) of the Hellinger risk of the MLE 
estimator 



R 



(K, S) 



h 2 Po, P ( 



(K, S) 



(9) 



The density P(K', s*) is called oracle for the Hellinger risk. It is not an es- 
timator, since it depends on the true density Po- However it can be used as 
a benchmark to quantify the quality of our model selection procedure: in the 
simulation performed in paragraph 4.3.2, we compare the Hellinger risk of the 



selected estimator P, 



(K n> s n ) 



to the oracle risk. 



3 New criteria and non asymptotic risk bounds 
3.1 Main result 

Our main theorem provides an oracle inequality for both Case 1 and Case 2. It 
links the Hellinger risk of the selected estimator to the Kullback-Leibler diver- 
gence KL between the true density and each model in the models collection. 
Unlike KL which is not a metric, the Hellinger distance h permits to take 
advantage of the metric properties (metric entropy) of the models. 

Theorem 1. We consider the collection M of models defined above, and a 
corresponding collection of p-MLEs (P(k.s)) ( K s)eM' w ^ c ^ means that for every 
(K, S) e M 

Jn(P(K.S)) < „ jnf Jn{Q)+P- 

Let A max = sup 1<;<i Ai, and let £ be defined by £, = ^ / _ ) _°"' Y" ^ n Case 1 and 

max 



in Case 2. Assume that £ < 1 or n > £ K . 
2(1 + 3V2) L -1 

There exists absolute constants k and C such that whenever 




( ilnn+ ^lnL, ^ +]nL 
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for every (K,S) e M, then the model M^ n> s n ) wflere 



minimizes 



cnt{K, S) = l n (P( K ,s)) + pen n (K, S) 
over M exists and moreover, whatever the underlying probability Pq , 



h 2 (P^P 



<C( inf (KL(P ,M (K , s) )+pen n (K,S))+p 1 



w/iere, /or every (if, S) e M, KL (Po,-M(x,S)) = infQeA4 (A ., s) KL(P ,Q). 

The condition £ < 1 is used in the proof to avoid more complicated calcula- 
tions. In practice, £ is very likely to be smaller than 1 for L not too small. 
Note that as soon as n > 2L, (10) is simplified in the following way 



K,S) 



pen„ (K, S) > k I 5 + ^ ^ In n + ^ In l\ — ^ 
„, , ,. „ , Inn D 

The leading term for large n is k -, which is a multiple of the penalty 

2 n 

function of BIC. As a consequence, we can apply Theorem 2 from [Toussile and Gassiat, 
2009]: when the underlying distribution Pq belongs to one of the competing 
models, the smallest model (Kq,Sq) containing Po is selected with probability 
tending to 1 as n goes to infinity. 

Such a penalty is not surprising in our context: it is in fact very similar to the 
one obtained in [Maugis and Michel, 2009] in a Gaussian mixture framework. 

Sharp estimates of k are not available. Theorem 1 is too conservative in 
practice, and leads to an over-penalized criterion which is outperformed by 
smaller penalties. So it is mainly used to suggest the shape of the penalty 
function 

pen n (*r,S) = A^^ (11) 

where A is a parameter to be chosen depending on n and the collection M — but 
not on (K,S). Slope heuristics [Arlot and Massart, 2009, Birge and Massart, 
2007] can be used in practice to calibrate A: this is done in Section 4, where we 
use change-point detection [Lcbarbicr, 2002] in relation to slope heuristics. 

Since h 2 is upper bounded by 2, the non-asymptotic feature of Theorem 1 is 
interesting when n is large enough with respect to D^x,s)- However, even with 
small values of n, the simulations performed in Subsection 4.3 show that the 
penalized criterion calibrated using the slope heuristics keep good behaviors. 

3.2 A general tool for model selection 

Theorem 1 is obtained from [Massart, 2007, Theorem 7.11]. This last result 
deals with model selection problems by proposing penalty functions related to 



geometrical properties of the models, namely metric entropy with bracketing for 
Hcllinger distance. 

The framework here is the following. We consider some measurable space 
(A, A), and \x a er-finitc positive measure on A. A collection of models {A4 m )meM 
is given, where each model M. m is a set of probability density functions s with 
respect to \i. The following relation permits us to extend the definition of h to 
positive functions s or t whose integral is finite but not necessary 1. Denoting 
■s/s the function defined by y/s(x) = y/ s(x), and by || • H2 the usual norm in 
L 2 (/i), then 

h(s,t) = \\V~s-Vt\\ 2 . 

Let us now recall the definition of metric entropy with bracketing. Consider 
some collection F of measurable functions on A, and d one of the following 
metrics on F: h, || ■ ||i, or || ■ [I2. A bracket [I, u] is the collection of all measurable 
functions / such that I < f < u. Its <i-diameter is the distance d(u,l). Then, 
for every positive number e, we denote by -/V[.](e, F, d) the minimal number of 
brackets with d-diameter not larger than e which are needed to cover F. The 
d-entropy with bracketing of F is defined as the logarithm of iVr.n (er, F, d), and 
is denoted by (e, F, d). 

We assume that for each model Ai m the square entropy with bracketing 
sjHvAe, A4 m ,h) is integrable at 0. Let us consider some function </> m on R+ 
with the following properties 

(I). 4> m is nondecreasing, x i-> <j) m {x)/x is non-increasing on (0, +00) and for 
every a £ R + and every u £ M m 

^H[.] (x,S m (u,a),h)dx < <j> m ((r), 

where S m (u, a) = {t £ M m ■ \\Vt - \fu\[ 2 < a). 

(I) is verified in particular with <p m (a) = f° \/H[-] {x,M. m ,\i)dx. 
In order to avoid measurability problems, we suppose that for each m £ M, 
the following separability condition is verified for M. m : 

(M). There exists some countable subset M! m of M. m and a set A' C A with 
fi(A') = (i(A) such that for every t £ M m , there exists some sequence 
(tk)k>i 01 elements of M' m such that for every x £ A', ln(tk(x)) tends to 
ln(t(x)) as k tends to infinity. 

Theorem 2. Let X\, . . . ,X n be iid random variables with unknown density 
s with respect to some positive measure (i. Let {A^ m } m gM be some at most 
countable collection of models, each fulfilling (M). We consider a corresponding 
collection of p-MLEs (s" m )meM- Let {x m }m£M be some family of nonnegative 
numbers such that 

^ e~ Xm — S < 00, 

m6M 



L 
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and for every m G M considering <j) m with property (i) define o~ m as the unique 
positive solution of the equation 

4>m{a) = V^° 2 - (12) 
Let pen n : M — > IR + and consider the penalized log-likelihood criterion 

crit(m) = 7„ (s m ) + pen„(m). 
Then, there exists some absolute constants k and C such that whenever 
pen n (m) > n ^er^ H —J for every m € M, 

some random variable fh minimizing crit over M exists and moreover, whatever 
the density s 

E s [h 2 (s,s ffi )] <C ( inf (KL (s, 7W m ) + penjm)) + p + - ) . 

ymeM n J 

In Theorem 2, cr^ has the role of a variance term of s m , while the weights 
x m take into account the number of models m having the same dimension. 

3.3 Proof of Theorem 1 

In order to apply Theorem 2, we need to compute the metric entropy with 
bracketing of each model M(k,s)- This is done in the following result, which is 
proved in Appendix A. 

Proposition 1 (Bracketing entropy of a model). Let rj : M. + —> R + be the 

increasing convex function defined by 

Case 1: 77(e) = (1 + e) L+1 - 1. 

Case 2: 77(e) = (1 + e) (1 + s/2 e (2 + e)) L - 1. 

For any choice of K and S, M.ik,s) fulfills (M). For any e € (0, 1), 
ff H (f?(e),M (K ,s),h) < D {K>S) In ( - ) + C (Jf>S) , 



whe 



{KS) =U ln(2ne)D (KtS) + ln(4^e) (1 K >2 + L + (K - 1)\S\) 

L 

+ 1 K > 2 \n(K + 1) + HA + 1) + (K - 1) HA + 1) 



1=1 les 



(13) 



C(k,s) is a technical quantity measuring the complexity of a model M(k,s)- 
In the next step we establish an expression for <f) m . All following results are 
proved in Appendix B. 
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Proposition 2. For any choice of m = (K,S), the function cj> m defined on 

(0,77(1)] by 



<p m (cr) = (2Vhi2 ^D (K ^ S) + y/ ~C(k~s) - D {KyS) In 77 1 (er)) a 
fulfills (I). 

We do not define <f) m for a bigger than rj(\), to avoid more complicated 
expressions. This is why a condition on £ appears in the following lemma: 

Lemma 1. Let A max = sup Ai, £ = ^ " lax _^Z i n Case 1, and £ = 

1<1<L 2 + — 1 

in Case 2. Then, for all n > 1 if £ < 1, and for n > £ 2 K 



2(1 + 3V2) L - 1 
otherwise, the solution a m of (12) verifies a m < rj(l). 

From Proposition 2 we can deduce an upper bound for o~ m , with a similar 
reasoning to [Maugis and Michel, 2009]. First, a m < 77(1) entails n" 1 (a m ) < 1, 
and we obtain the lower bound o~ m > cf m , where 

'2Vln2VA„ + v / CJ. (14) 



This can be used to get an upper bound 
1 



o m <-={ 2 Vln2 VA» + ^C m - D m In r?" 1 (5= TO ) . (15) 



Let us now choose the weights x m . If we take something bigger than tot^, 
this will change the shape of the penalty in Theorem 2. We define 

x m = (ln2)D m . 

The following Lemma shows that this choice is suitable. 

Lemma 2. For any model M m , with to 6 M as above, let us set x m = (In 2)D m . 
Then 

£ e-*« < (3/4) L . 

To express the penalty function we have to lower bound rj^ 1 (a m ). This is 
done in the following Lemma. 

Lemma 3. Using the preceding notations, 



*l + ^ < 5+ Jmax fllnn + ilni,^ +lnL 

n n \ y \2 2 2 

This ends the proof of Theorem 1. 
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4 In practice 



In real datasets the numbers A\ of possible states at each variable X 1 are not 
necessarily known. The numbers Ai of observed states can be used instead. In 
fact, the MLE estimator select a density with null weight on non-observed states. 
Then, in each model M.ik,s)i an approximated MLE estimator can be computed 
thanks to the Expectation-Maximization (EM) algorithm (see [Dempster et al., 
1977]). 

The other two points that have to be done before reaching the final estimator 
P{R s ) are ^ ne choice of the penalty function, and the sub-collection of models 
on which the EM algorithm will be used. These two points are discussed in 
Subsections 4.1 and 4.2. Then simulations are presented in Subsection 4.3. 

4.1 Slope heuristics and Dimension jump 

Theorem 1 suggests to take a penalty function of the shape (11), defined mod- 
ulo a multiplicative parameter A which has to be calibrated. Slope heuris- 
tics, as presented in [Arlot and Massart, 2009, Birge and Massart, 2007], pro- 
vide a practical method to find an optimal penalty pen opt (m) = X op tD m /n. 
These heuristics are based on the conjecture that there exists a minimal penalty 
pen min (m) = A m i„D m /n required for the model selection procedure to work: 
when the penalty is smaller that pen min , the selected model is one of the most 
complex models, and the risk of the selected estimator is large. On the contrary, 
when the penalty is larger than pen min , the complexity of the selected model is 
much smaller. Then the optimal penalty is close to twice the minimal penalty: 

pen opt (m) ps 2X min D m /n. 

The name "slope heuristics" comes from A m j n being the slope of the linear 
regression j n \ P-nij ~ D m /n for a certain sub-collection of the most competing 
models m. For example, on the left panel of Figure 1 below, a slope is visible 
for the models containing the true model M(k , g y Even if this example is 
favorable and mainly here for illustration purposes, it shows that the slope 
heuristics are sensible with the modelings of the present work. 

Instead of estimating A m i n by linear regression, another method is jump de- 
tection. Suppose we have at hand a reasonable grid Ai < . . . < A r of candidate 
values of A m ; n , and a sub-collection M. exp i orec [ of the most competitive models. 
Each Ai leads to a selected model fhi with dimension Df^ . . If you plot Dff li as 
a function of Ai, A TO j„ is expected to lie at the position of the biggest jump. 
However, the right panel of Figure 1 illustrates an important point: in that ex- 
ample the biggest jump is at A ps 5.1, but A m i n is around 0.9, which corresponds 
to several successive jumps. We propose an improved version of the dimension 
jump method of [Arlot and Massart, 2009], based on a sliding window: we con- 
sider at a time all jumps in an window of h > 1 following intervals in the grid. 
Algorithm 1 below describes the procedure. 
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200 400 600 800 1 2 3 4 5 6 7 

Dimension ^ 



Figure 1: Two ways to compute the slope, on a simulated sample of 1000 indi- 
viduals, with 8 clustering variables among 10, and 5 populations. Models have 
been explored via the modified backward-stepwise described in subsection 4.2, 
the number K of clusters varying from 1 to 10. The size of the sliding window 
is 0.15. 



4.2 Sub-collection of models for calibration 

For a given maximum value K max of the number of clusters, the number of 
models under competition is equal to 1 + (K max — 1) * (2 L — l). Since this 
number is huge in most situations, it is very painful to consider all compet- 
ing models for calibration of the parameter A. On the other hand, we need 
enough models to ensure that there is a clear jump in the sequence of selected 
dimension. We consider the modified backward-stepwise algorithm proposed in 
[Toussile and Gassiat, 2009], which explores of cardinalities of S. It enables to 
gather the most competitive models among all possible S for a given number K 
of clusters and a given penalty function pen n . It gives also the choice to add a 
complementary exploration step based on a similarly modified forward strategy. 
We will refer to this algorithm as explorer (K, pen n ). 

Since we do not know the final penalty during the exploration step, we 
consider a reasonable grid 1/2 = Ai < . . . < A r = Inn containing both penalty 
functions associated to AIC and BIC (7). To each value Xi of the grid is 
associated a penalty function pen A .. We launch explorer (K, pen A .) for all 
values of K in {1, ... , K max } and for all values of Aj of the above grid, and we 
gather the explored models in M. exp i ore( i- This sub-collection seemly contains 
the most competitive models and it is then used to calibrate A. 
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Algorithm 1 Calibration of Penalty \M. e xpiored, {K) i= i r , h 

for i = 1 to n\ do 

m 4 <- arg min j -y n (P TO ) + AjD TO /n [ 

end for 

W <- min argmax {D fht _ h - D fflt } 
i£{/i+l,...,r} 

max j^J G [iend — *end — 1] i Df^. — ^m» (Jn(J = ^fni end -h ~ 

return A m i n 



4.3 Numerical experiments 

Our proposed procedure with a data-driven calibration of the penalty function 
has been implemented for Case 2 in the software MixMoGenD (Mixture Model 
using Gcnotypic Data), which already proposed a selection procedure based on 
asymptotic criteria BIC and AIC (see [Toussile and Gassiat, 2009]). Here, we 
conduct numerical experiments on simulated datasets for performances assess- 
ment of the new non asymptotic criterion with respect to BIC and AIC. 

We present two experiments, both in Case 2. The first one considers the 
consistency of the selected model: we study how the procedure retrieves the 
main features of the true model as the number of individuals in the datasets 
increases. In the second one, we are rather interested in a validation of the model 
selection procedure from the oracle point of view: we compare the Hellinger risk 
of the selected estimator to the oracle risk. 



4.3.1 Consistency performances 

In this experiment we consider a setting with L = 10 variables of 10 states 
each. We chose a parameter with K = 5 populations of equal probability. The 
frequencies of the states have been chosen such that the genetic differentiation 
between the populations is decreasing with the variables rank. In the first 6 
variables, the populations are more separated. In the following 2 variables, the 
populations are very poorly differentiated. In the last 2 variables, the states 
follow the same uniform distribution in all populations. The whole parameter 
is available at http://www.math.u-psud.fr/~toussile/. 

We considered different values n of the sample size in [50, 900] and for each 
of them, 10 datasets have been simulated. The results are summarized in Figure 
2. The left panel gives the proportion of selecting the subset S n of clustering 
variables containing the first 6 variables, which are the most genetically differ- 
entiated variables. The right panel gives the proportion of selected models with 
K n = K . 

In this experiment, AIC seems to be the best criterion for variable selection; 
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Figure 2: The figure in the left panel gives the proportion of selected models 
with S„ 2 {1> • • • ) 6}, and the one in the right gives the proportion of selected 
models with K n = K , versus the sample size. 



however the different between AIC and the new criterion is not significant. 
It also appears that AIC estimates the number of clusters better than the 
other criteria for small sample sizes (around n = 100 and n = 200), but it 
overestimates this number from n — 500. On the contrary, the new criterion 
perfectly estimates the number of clusters for sample sizes > 300. BIC performs 
poorly for both variables selection and classification on datascts with small 
sizes. As expected, the data-driven calibration of the penalty function improves 
globally the performances of the selection procedure, and it gives thus an answer 
to the question "Which penalty for which sample size?" . 

It may happen that the results obtained on small sample sizes change a 
little from one run to another. In fact, the EM algorithm can miss the global 
maximum on such sample sizes, in particular in models of higher dimension. In 
our experiments, it is probably the case with some datasets of size n < 300, 
when the number of free parameters in the simulated model is > 310. 



4.3.2 Oracle performances of the estimator 

Since the new criterion is designed in an oracle perspective, it is interesting to 
compare the associated estimator to the oracle for Hellinger risk. Recall that 
the oracle is the estimator associated to the model indexed by the minimizer 



h 2 (P , P ( 



(K, S) 



over the collection of models 



(K*, S*) of the risk: 

In this experiment, we consider simulated datasets with reduced variability 
in order to reduce the computation time. The parameter underlying the data 
admits L = 6 variables, 3 states for each variable, and Kq = 3 populations 
with equal probability. The frequencies of the states have been chosen in such a 
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way that the genetic differentiation between the population is significant on the 
first 3 variables, very small on the 4 th and 5 th variables, while the states of the 
6 th variable follow the uniform distribution in all populations. Thus the true 
model is defined by K = 3 and So = {1, 2, 3, 4, 5}. The whole parameter is 
available at http://www.math.u-psud.fr/~toussile/. 

We estimated the oracle using a Monte Carlo procedure on 100 simulated 
datasets of size 500 each, and got K* = 3 and S* = {1, 2, 3, 4}. The results 
we obtained are summarized in Figure 3 and Table 1. 




Figure 3: The left panel gives the boxplots, means and their 95% confident 
intervals, for — \ — } Kn ' S "V ; the right panel gives the percentages of selection 

h 2 (P , P (if J gSjJ 

of the estimated oracle [k* , S*^j; three criteria have been used: AIC, BIC, 

and Cte*Dim which denotes the new criterion with data-driven calibration of 
the penalty function. 





AIC 


BIC 




AIC 




< 5.40e- 


05 


Cte*Dim 


< 2.02e- 05 


< 2.20e- 
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Table 1: The p- values of pairwise student tests comparing the means of 

the h 2 (^Pq, s \ The alternative hypothesis is that the mean of the 

Hcllingcr distance associated to the criterion in the first column is less than the 
one associated to the criterion in the first line. 



The worst behavior comes from BIC and it is not a surprise for two main 
reasons. First BIC is designed to find the true model which is different to the 
oracle in our experiments. Second, it is based on asymptotic approximation and 
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therefore requires large samples. In contrary, compared to AIC and BIC, the 
new criterion with data-driven calibration of the penalty function is significantly 
the best in the sense of Hellinger risk and the capacity of selecting the oracle. 
Recall that both AIC and the new criterion are designed to find the oracle (see 
Table 1). But like BIC, AIC is based on asymptotic approximations. So the 
advantage of the new criterion over AIC is probably that it is designed in a non 
asymptotic perspective. 

5 Conclusion 

In this paper, we have considered a model selection via penalization, which 
performs simultaneously a variables selection and a detection of the number 
of populations, in the specific framework of multivariate multinomial mixture. 
This leads to a clustering in a second time. Our main result provides an oracle 
inequality, under the condition of some lower bound on the penalty function. 
The weakness of such a result is that the associated penalized criterion is not 
directly usable. Nevertheless, it suggests a shape of the penalty function which 
is of the form pen„(m) = XD m /n, where A = A (n, M) is a parameter which 
depends on the data and the collection of the competing models. In practice A 
is calibrated via the slope heuristics. 

In the simulated experiments we conducted, the new criterion with penalty 
calibration shows good behaviors for density estimation as well as for the selec- 
tion of the true model. It also performs well both when the number of individuals 
is large and when it is reasonnably small. This gives an answer to the question 
"Which criterion for with sample size?" 

In the modeling we considered, the model dimension grows rapidly. In real 
experiments the number of individuals can be small, so other modeling with 
reduced dimension may be needed. We currently work on models which cluster 
the populations differently for each variable, as well as models which allocate 
the same probability to several states. 
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A Metric entropy with bracketing 



We first state several results about the entropy with bracketing, which will be 
used to prove Proposition 1. They are mainly adapted from [Gcnoveve and Wasser 
2000], but several are improved or writen here in a more general form. These 
lemmas can be seen as a toolbox to calculate the metric entropy with bracketing 
of complex models from the metric entropy of simpler elements. 

We consider a measurable space (A, A), and [i a cr-finite positive measure on 
A. We consider a model Ai, which is a set of probability density functions with 
respect to fi. All functions considered in the following will be positive functions 
in L x (jU). 

Lemma 4. Let e > 0. Let [l,u] be a bracket in L 1 (^t), with h-diameter less 
than e, and containing s, a probability density function with respect to ji. Then 



J ldpi<l< J u dfi < (1 + ef 



Proof. First two inequalities arc immediate, from I < s < u. For the last one, 
we use triangle inequality in L 2 (/^), and the definition of h: 




I dfi + h(u, I) 

<(l + e) 2 . 



□ 



Lemma 5 (Bracketing entropy of product densities). Let n > 2, and consider 
a collection (Ai,Ai, fJ-i)i<i< n of measured space. For any 1 < i < n, let Mi be 
a collection of probability density functions on A4 fulfilling (M). Consider the 
product model 

M = {s = <8" =1 Si;Vl < i < n, s t G Mi} . 

A4 contains density functions on A = JX" =1 Ai with respect to fi = ®^ = i^i- 
M. fulfills (M) and, for any sequence of positive numbers (Si)i<i< n , if e > 

nr=i( i +^)- 1 then 

n 
i=l 

Proof. Let us consider some s = <8>™ =1 Si in Ai. For 1 < i < n, let Ai' i: A\ and 
a sequence (tj,fc)fe>i be such as needed for Mj to verify (M). Then, with the 
choice t k = ®™ =1 k,k and A' = J]™ =1 A' i: (M) is true for M too. 
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Let S > 0. For any 1 < i < n, let [U,Ui\ a bracket containing Si, with 
h-diameter less than Si. Let us set I = <g>™ =1 ^, and w = C3>f =1 u;. Then s belongs 
to bracket [l,u]. We can compute its h-diameter: 



r / 71 / n J n 

x me n^-n^n < 

\ JA V j = l V i=l i=j i=l i=j + l 

n j—1 r~, n r~, 



< 

j=\ i=l V 
n n n 

n (i+^)=n( i+ ^)- 1 

j = l i=j + l 3 = 1 

thanks to triangle inequality and Lemma 4 (empty products equal 1). 

Let e > n"=i(l + &i) — 1- For an y 1 — * — n consider a minimal covering of 
M.i with brackets of h-diameter less than Si. With the previous process we can 
build a covering of A4 with brackets of h-diameter less than e. So the minimal 
cardinality of such a covering verifies 

n 

N U (e,M,h)< IJJV H (^Mi.h). 

i=l 

□ 

Lemma 6 (Bracketing entropy of mixture densities). Let n > 2, and for any 

1 < i < ??, /e£ Al; 6e a se£ o/ probability density functions, all on the same 
measured space (A, A, /i) and fulfilling (M) . Let us consider the set of all mixture 
densities 

M = nisi : 7r = (7ri)i<i< n G S n _i;Vl <i<n,Si G Mi j . 
TTien .M fulfills (M), cmc? /or any (5 > 0, ?y > 0, tmd e > o" + 77 + £77, 
#[.] (e, M, h) < ff H (5, S„_i, h) + Y; #[•] (77, M u h) . 

i=l 

Proof. First, let us note that §n_i is separable for its usual topology. Then, 
checking that M fulfills (M) is easy, and we do not explicit it. 

We do not develop either the proof of the last relation, because it is exactly 
the same as in [Genoveve and Wasserman, 2000, proof of Theorem 2]. Let us just 
say that at the end we get, using our Lemma 4 instead of [Genoveve and Wasserman, 
2000, Lemma 3], 

h 2 (l,u) < rf (1 + 5f + 5 2 + 27/5 (1 + S) 



<e 2 . 



□ 
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Next result is just Lemma 2 from [Genoveve and Wasserman, 2000]: 

Lemma 7 (Bracketing entropy of the simplex). Let n > 2 be an integer. Let /i 
be the counting measure on {1, ... ,n}. We identify any probability on {1, . . . , n} 
with its density s G § n _i urai/i respect to fj,. Then, ifO < 5 < 1, 



H V] (<5,S n _i,h) < (n-l)ln 



In 2 + ln(n + 1) + n ln(2?re) 



To deal with Case 2, we also need the metric entropy of the collection of all 
Hardy- Weinberg genotype distributions for a given variable. 

Lemma 8 (Bracketing entropy of Hardy- Weinberg genotype distributions). 

Suppose that, for some variable I, there exist A\ > 2 different states. Let 0; 
be the collection of all genotype distributions following Hardy- Weinberg model 
(1 ). Then Q, t fulfills (M), and for any 8 > and e > V2 8 (2 + 8), 

H V] (e,n h h) <H V] (S,S Al ^,h). 

Proof. (1) permits to associate a parameter a = {ax, . . . , a. At) € to any 

density in f2;. More generally, for any a G [0, 1} A ' , we define a function 

d(xy%) (2 1^1=3:2) ^1^3:2 

on the set of all genotypes x = {a; 1 ,^ 2 } on Ai states. Consider some 8 > and 
d a e Oj. Let [l,u] be some bracket containing a, with h-diameter less than 8. 
Then d a belongs to the bracket [di,d u \. Let us calculate its diameter. 



( y/2u a u b - \Jll a k 



l<a<b<A, 



h 2 (d h d u ) =y^(u a - l a ) 

a=l 

At A t 

< 2 ^2 X] (V u a u b - V u Jb + y/ujb - y/lJbj 



a=l 6=1 



< 2 



\ a=l 



Ub — y/k 



6=1 



\ a=l 



6=1 



<2((l + 8)S + Sy 



using Lemma 4. So h(dz, d„) < \/2 8 (2 + <5). 

Let (<*(*)) fc>i a sequence of elements of H Q" 4 ', which tends to a for 

the usual topology as k tends to infinity. Then, for any genotype x = {a; 1 ,^ 2 }, 
\nd a {k)(x) tends to ln<i Q (a;). Therefore £7; fulfills (M). □ 

Proof of Proposition 1. We build the proof for Case 2. For Case 1 everything is 
similar, with a simplification: we directly have S^-i instead of f2;. 

Using (2) we see that a probability P(k,S){ ' 1^) i s the product of a mixture 
density corresponding to the variables in S, and a product density in fiz 
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for the other variables. Let us call M. the collection of all mixtures of K densities 

Wc first deal with the non clustering variables. Using Lemma 5 and Lemma 8, 
(g)^ 5 0; fulfills (M). For any e € (0, 1), 

ff H ( (1 + 2V2e + V2e 2 ) l -^ - 1, (g) 0;, h J <J2 H l} (2^2e + V2e 2 , 0;, h) 
V ifs / i$s 

<J2 H ll (e.S^-i.h). 

On the same way 

V les J les 

We can apply Lemma 6, and get that M. fulfills (M) and 

((l + 2\/2e + V2£ 2 ) |s| (l+e)-l,7W,h) 

< l K >2H[-] (e,§ K -i,h) + Kj2 H U (e,S^_i,h). 

les 

Lemma 5 again, applied to M and gives that J^A^k.s) 

fulfills (M), 

and for any e € (0, 1), 

H H (?](s),M {K ,s),h) 
< 1k> 2 H[.] (e, Sjr_i, h) + if ^ ff H (e, Sa,-i, h) + ^ #[.] (e, Sa,-i, h) . 

At this point, it only remains to use Lemma 7 and to compute the constants. 

□ 

B Establishing the penalty 

First, we need to establish some properties of function 77. 

Lemma 9 (Properties of function 77). We consider the function rj defined in 
Proposition 1, from R + into R + . 7/ is nonnegative, increasing and convex. 
7/(0) = 0, and t/(0) = L + 1 in Case 1 while r/(0) = 2^2 L + 1 in Case 2. 

Proof. The proof in Case 1 is immediate, so we develop only Case 2. 

Setting u(x) = 1 + 2\/2x + V2x 2 , we can write rj(x) = (1 + x) u{x) L — 1. 
Then, calculus gives 

rf{x) = (2L + 1) u(x) L + 2L(V2- 1) u^x)^ 1 . 
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Since u is positive on (0, +00), 77 is increasing. But 77(0) = 0, so rj is nonnegative 
on R+. We also have rf(0) = 2^/2 L + 1. Next, 

rf'{x) = 2V2(1 +x) {(2L 2 + L)u(x) L - 1 + 2L(L- 1) (y/2 - 1) u(x) L - 2 ^ 

which is positive on R + . □ 

Proof of Proposition 2. Let < a < 77(1), and <5 = rf~ (cr). Then, for any 



^ #[.] (x, A^ m (w, cr), h)cfe 

00 /.r;(2~ 3 + 1 <5) 

^ ^ / A/i^.] (z,A4 m ,h)da; 

r~; Jv(2-is) v 



OO 

< ^ (r? (2^'+M) - 77 (2"^)) VC m - D m hi6 + D mJ \n2 

3=1 



+ VAn In 2 £ Vj (r? (2^ +1 c7) - r? (2^'t5)) . 
3=1 

We deal with the last term of this sum in the following way: 

oc 00 

V] (V - 77 (2-^)) < J2 3 ( 2 ~ 3+1S ) - V ( 2 ~ JS )) 

3=1 3=1 



00 



k=l 



So 



y/ H[.] (x, .M m (u, cr), h)da; < m (er). 



Since 77 is increasing, (j) m (x)/x is decreasing. To check that 4> m is nondecreasing, 
it is enough to prove that function fix) = x-^Jb — lnr7 _1 (a;) is nondecreasing on 

(0,77(1)], where b = From (13), we get C m > ^p±D m > D m , so b > 1. 
Calculus gives 

f'(x) = y/b-lnrrHx)- 



2rr 1 (x) ?/ (ir 1 ^)) y/b-]nr)- 1 (x) 
Let y G (0,1]. 77 is convex on (0, 1], and that entails J 1 ^^ < 1- Thus 



y/b-hy f (V(V)) > b - In y - 1/2 > 0. 

□ 
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Proof of Lemma 1. Since 4>m{x)/x is non-increasing, for any a > such that 
\fno- 2 > 4> m (cr), a > a m . So, we look for situations such that Jn > -^-tttt— ■ 

n (i) 

For all 1 < / < L, A x > 2. Since § ln(l + x) < x - 1 for x > 2, we get the 
following bounds 

l + \n(2*) Dm < Cm <( 2 + ln(27r) + ln2\ ^ (lg) 



2 



Therefore 



7,2(1) 77(1) 

On the other hand, we have 

D m K L A ma:x . 

So, since <p m {x)/x 2 is decreasing, er m < 77(1) as soon as n > £ 2 K. This is true 
when £ < 1, since K < n: the number of clusters is not bigger than the number 
of individuals. □ 

Proof of Lemma 2. We define 5 = 1/2, from which e~ Xm = S Dm . If we consider 
the collection M, we can discern two cases: K = 1 and S = 0, or K > 2 and 
5^0. So, using (6), 

£ e -*« = 5 Ef =1 (^-i) ( 1 + £ £ (*i+E, 6S (*-i)^ A ' 

= ^,-i)f 1 + r ggf^ ' 



<S L 



( 1 + rrjS 4ls| ) 



^(1 + <5) J 



□ 



Proof of Lemma 3. rj 1 in concave and nondecreasing, 77(0) = 0, so for any 
< x < 77(1), 

r] {x) > min(x,2). 
On the other hand (14) and (16) entail 



n V n 



^ m >cJ—>cJ- (17) 
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where d = 2x/Iri2 + J 1+1 " (27r) > 2^2. Therefore 



2 

,-1 



-Inn -1 (^m) < - In f - ^ j - In 2 + max ^0, ^ (Inn - lnL - ln2)^ 

Consider Case 1. Since 77 is a convex function and rf(0) = L + 1, 

2 



^ X (2)< 



L + l 



Now, 



Then 



2 ^ L+1 



L + l V L + l 



^ 1 (2) > 2/{L + l) > 



1< e 2 - 1. 



Therefore 



and 



2 - »j(2/(L + l)) - ( e 2-l)(L + l)" 
-In ( 11 } 2 ^ \ < ln(e 2 - 1) - In 2 + InL + ln(3/2) 



Imf 1 (a m ) < ln(e 2 - 1) - - In 2 + In 3 + max — In n H — In L, 1- In L 

Using now (15), we get 



n n 



— <— U+( 2 ^2+\/ 2 + ln (2 7 r) + ^-ln ?r i(? m ) 



Dm I 1 



< — [—= + 2Vln2 + ^2 + ln(27r) - 3 In 2 + In 3 + ln(e 2 - 1) 



n 



V2 



. ' In n + In L In 2 
' max I , — + In L 



D m / /lnn + lni In 2 
<— 5 + Wmax , — +lnL 



Next, consider Case 2, and follow the same method. Then 



V2L 
and 

"(7Il) s2oxl> ( 2 + 71 
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This leads to 

- In if 1 (x) < 2 + + + In L - In min(x, 2) 

V 2 2 

and 

— In tj^ 1 (<r m ) < 2 H — p + max I - Inn + - In L, — — h In L I 
v2 \2 2 2 J 

Now wc obtain 



<^(^ + 2^ + J4 + ln(2.) + ^ + ln2 



n n \y/2 V 2 

\ 2 



1 max 



In n + In L In 2 



^ Am / _ . / / In n + In L In 2 



ra I \ V 2 '2 
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